feat(server): central audited chokepoint for customer-data drops + CI guard (truehomie hardening) by mastermanas805 · Pull Request #50 · InstaNode-dev/provisioner

mastermanas805 · 2026-06-06T13:36:18Z

What & why

Addresses the OPEN root-cause class of the truehomie-db DROP incident (2026-06-03): an active Pro customer's Postgres DB + role were dropped on the shared postgres-customers cluster by an unidentified path with NO audit_log row. The provisioner was a dumb executor — DeprovisionResource dropped whatever (token, provider_resource_id, resource_type) it received and kept no record of its own.

Full enumeration + design + ranked root-cause hypotheses: docs/ci/DATA-INTEGRITY-DROP-PATH-AUDIT.md (docs repo).

Changes

guardedDrop chokepoint (internal/server/drop_chokepoint.go) — the single sanctioned wrapper for a customer-data destruction. All 8 backend Deprovision dispatches in DeprovisionResource (postgres/redis/mongo/queue × shared/dedicated) now route through it. It emits, before the drop:
- a structured event=provisioner.drop audit log line: token, provider_resource_id, resource_type, backend, request_id, caller (gRPC peer addr — the attribution that was missing in the incident). Always-on, app-layer equivalent of the cluster log_statement='ddl' trap.
- instant_provisioner_drop_total{resource_type,backend,outcome} (eager) — an abnormal drop rate is now alertable (the incident was a burst of un-attributed drops).
CI guard (internal/server/drop_guard_test.go) — AST-iterating (rule 18). Walks the provisioner source and fails the build if any DROP DATABASE/ROLE/USER SQL literal, ACL DELUSER, mongo dropDatabase, or Database(...).Drop(...) call appears outside a sanctioned deprovision function reached through guardedDrop. A new un-audited drop path cannot merge. Proven non-vacuous (TestDropGuard_FlagsUnsanctionedSite) and comment-safe (TestDropGuard_IgnoresComments); does not flag Collection.Drop (sentinel cleanup, not customer data).

Surgical / safe

Behaviour of what gets purged in the TTL reaper / team-deletion / user-DELETE flows is unchanged — the chokepoint only ADDs the audit + metric around the existing dispatch. The breaker wrapping is preserved. No proto change, no platform-DB access, no flag needed.

The larger fix for invariant (3) — a DeprovisionRequest.deletion_intent proto field + provisioner terminality enforcement (needs platform-DB read, fail-closed) — is designed + filed, not rushed, to avoid breaking the legitimate purges that drop active rows by design.

Verification

make gate: green (build + vet + go test ./... -short -count=1 -p 1, all packages).
New tests (9): chokepoint ok/error/breaker-open outcomes + metric assertions, caller attribution, and the CI guard (real-tree scan + synthetic positive + comment negative).
Rule 25: alert + dashboard tile + catalog row for instant_provisioner_drop_total ship in the matching infra PR.

🤖 Generated with Claude Code

… guard Addresses the OPEN truehomie-db DROP incident root-cause class: an active customer's DB+role were dropped by an unidentified path with NO audit trail. The provisioner was a dumb executor — DeprovisionResource dropped whatever token it was handed and kept no record of its own. - guardedDrop (drop_chokepoint.go): the single sanctioned wrapper for a customer-data destruction. Every backend Deprovision dispatch in DeprovisionResource now routes through it. Emits a structured `event=provisioner.drop` audit log line (token, provider_resource_id, resource_type, backend, request_id, caller from gRPC peer) BEFORE the drop + instant_provisioner_drop_total{resource_type,backend,outcome}. This is the always-on, app-layer equivalent of the cluster's log_statement='ddl' trap. - drop_guard_test.go: AST-iterating CI guard (rule 18) that FAILS the build if any DROP DATABASE/ROLE/USER SQL literal, ACL DELUSER, mongo dropDatabase, or Database().Drop() call appears outside a sanctioned deprovision function reached through guardedDrop. A new un-audited drop path cannot merge. Proven non-vacuous: a synthetic un-sanctioned DROP is flagged; a commented DROP and a Collection.Drop (not customer data) are not. Behaviour of WHAT gets purged in the TTL reaper / team-deletion / user-DELETE flows is unchanged — the chokepoint only ADDs the audit line + metric around the existing dispatch. The larger deletion-intent proto field + terminality enforcement (needs platform-DB read, flag-gated) is designed + filed in docs/ci/DATA-INTEGRITY-DROP-PATH-AUDIT.md. make gate: green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…002) staticcheck QF1002: prefer errors.Is over == for error comparison; also more correct for a potentially-wrapped circuit.ErrOpen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…migration-safety CI gate (truehomie D3) (#56) * feat(dropguard): name-convention guard on every customer-data drop + migration-safety CI gate (truehomie D3) Layer 2 of the truehomie-db DROP hardening (layer 1 = the guardedDrop audit chokepoint, PR #50). The chokepoint made every sanctioned drop AUDITABLE; this makes a MIS-TARGETED drop UNEXECUTABLE: - internal/dropguard: charset+denylist validation of naming tokens and final db/user identifiers. Refuses empty/garbage tokens, SQL metacharacters, system databases (postgres, template0/1, instant_customers, instant_platform, mongo admin/local/config) and admin roles (instanode_admin, instant_cust, doadmin, postgres, redis "default"). Deliberately permissive on token SHAPE (uuid, dashless hex, pool-*, e2e cohorts all pass) so no legitimate legacy deprovision can wedge — this repo auto-deploys. - server.guardedDrop: validates the poolident-resolved naming token BEFORE dispatch; refusal = error + `provisioner.drop.refused` log event + instant_provisioner_drop_total{outcome="refused"}. - postgres local Deprovision + cleanupProvisionPartial (non-chokepoint rollback path), dedicated deprovisionLocal: validate the FINAL constructed dbName/username via validateDropTargets right before DROP. - mongo Deprovision: token guard before connect + per-candidate name guard (refused canonical = hard error; refused legacy candidate = skip). - redis local/dedicated DELUSER: skip any non-tenant-shaped username (ACL DELUSER "default" would brick the shared pod). - scripts/check-migration-safety.sh + ci.yml step: destructive DDL (DROP TABLE/DATABASE/SCHEMA/COLUMN/ROLE/USER, TRUNCATE, DELETE FROM without WHERE) in migrations/*.sql fails CI unless the file carries `-- destructive: acknowledged <reason>`. Self-test fixture suite included. No behavior change for legitimate deprovision flows (regression tests assert uuid/dashless/pool/e2e token shapes still reach the backend). New refusal branches are unit-tested in every package; dropguard itself is 100% covered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(pool): chokepoint contract on the pool reaper's deprovision dispatch pool.deprovisionBacking is the ONE customer-infra drop dispatch that does not pass through server.guardedDrop (no gRPC request). Give it the same contract: dropguard naming-token validation (refusal = provisioner.drop.refused) and the `provisioner.drop` audit event emitted BEFORE the backend executes, with caller="pool_reaper" and the proto-style resource_type strings so one NR query covers RPC drops and pool-reap drops alike. Without this, every pool reap of a failed postgres item would page the new customer-db-destructive-ddl NR alert as an unsanctioned DROP (and was itself an un-audited drop path). Test fixture fix: fakeRows.Scan filled every string dest with "postgres", which dropguard now (correctly) refuses as a pool naming token — fill positionally so pool_token carries a valid pool-token shape. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(dropguard): identifier length bound is an absurdity bound (128), not postgres's 63 The live-cluster CI suite provisions tokens like tok<nano>_<TestName> (>63 bytes with the usr_ prefix) — postgres truncates such identifiers CONSISTENTLY on CREATE and DROP, so they round-trip fine in reality, and the 63-byte refusal wedged their Deprovision (4 coverage-job test failures). Keep the cap purely as an absurdity bound at 128. Verified the four failing live tests pass against a real postgres 16. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Manas Srivastava <[email protected]> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

mastermanas805 enabled auto-merge (squash) June 6, 2026 13:37

fix(server): use errors.Is for breaker-open check in dropOutcome (QF1…

f850351

…002) staticcheck QF1002: prefer errors.Is over == for error comparison; also more correct for a potentially-wrapped circuit.ErrOpen. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mastermanas805 merged commit 11475eb into master Jun 6, 2026
12 checks passed

mastermanas805 mentioned this pull request Jun 10, 2026

feat(dropguard): name-convention guard on every customer-data drop + migration-safety CI gate (truehomie D3) #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): central audited chokepoint for customer-data drops + CI guard (truehomie hardening)#50

feat(server): central audited chokepoint for customer-data drops + CI guard (truehomie hardening)#50
mastermanas805 merged 2 commits into
masterfrom
feat/drop-chokepoint-audit-guard

mastermanas805 commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented Jun 6, 2026

What & why

Changes

Surgical / safe

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant